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[57] ABSTRACT 

A supervised procedure for obtaining weight values for 
back-propagation neural networks is described. The 
method according to the invention performs a sequence 
of partial optimizations in order to determine values for 
the network connection weights. The partial optimiza- 
tion depends on a constrained representation of hidden 
weights derived from a singular value decomposition of 
the input space as well as an Iterative Least Squares 
optimization solution for the output weights. 
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ACCELERATED TRAINING APPARATUS FOR 
BACK PROPAGATION NETWORKS 

ORIGIN OF THE INVENTION 5 

The invention described herein was made by an em- 
ployee of the United States Government and may be 
manufactured and used by or for the Government of the 
United States of America for governmental purposes 
without the payment of any royalties thereon or there- 
for. 

BACKGROUND OF THE INVENTION 

The back propagation neural network, or “BPN”, is 15 
an extremely useful neural network algorithm. The 
BPN “learns” a very general class of mappings which 
are usually represented as functions from R n to K m . 
Theoretically a 3-layer BPN can learn almost any map, 
but in practice the application of the BPN has been 20 
limited due to the enormous amount of computer time 
required for the training process. 

SUMMARY OF THE INVENTION 

The principal object of the present invention is to 25 
provide a training procedure for a feed forward, back 
propagation neural network which greatly accelerates 
the training process. 

Although the invention is, in principle, applicable to 
any neural network which implements supervised learn- 30 
ing through error minimization or a so-called general- 
ized delta rule, the neural network architecture for 
which the invention is best suited consists of a three- 
layer feed forward network having nl inputs, n2 hidden 
units and n3 outputs. The invention contemplates that 35 
all learning will be supervised; i.e., correct outputs are 
known for all inputs in the training set. 

Brief Description of the Method: 

The training method according to the invention is 
applied to a feed-forward neural network having at least 40 
two layers of nodes, with a first layer having nl nodes 
and a second layer having n2 nodes, each node of said 
second layer having a weight W2/, where i= 1, . . . ,n2. 
The method comprises the steps of: 

(a) applying to the network a plurality p of input 4 
vectors for which the respective outputs are known, the 
input vectors forming an input matrix 


2 

than or equal to zero, thereby to provide an optimal 
view of the input data; 

(c) requiring that each weight W 2 / be adjusted in a 
direction colinear with a particular singular vector 
where said singular vector is selected periodically in 
such a way as to effect the greatest reduction in the 
deviations between network outputs and the desired 
training outputs and 

(d) Employing a direct solution method exemplified 
by, but not limited to, the Iterative Lease Squares (ILS) 
technique described subsequently to obtain any so- 
called “output weights” which in the instance of a 3- 
layer feed-forward network would be the weight matrix 
W3. 

Explanation of the Method: 

Let the inputs for the training set be represented as a 
matrix X which will be called the input matrix. The 
entries in X will be denoted Xq where i= 1, . . . ,p and 
j = 1 , . . . ,nl with p being the number of examples which 
comprise the training set. A set of orthogonal (perpen- 
dicular) axes (called optimal axes) is extracted from the 
data matrix X which provides the optimal view of the 
data. The optimal axes provide a view which is “opti- 
mal” in the sense that the standard deviations of the 
projections of input vectors along these axes as a set 
have maximal standard deviation; i.e., the optimal axes 
“spread out” the data to the largest possible degree. 
There is a well known mathematical procedure for 
computing the orthogonal unit vectors which define the 
direction of these axes in space. The unit vectors will be 
the right singular vectors of the data matrix X. 

The equation describing the Singular Value Decom- 
position of the matrix X is 

X=UDV t (4) 

where U=u /j, i=l, . . . ,p, j = l, . . . ,nl is the matrix 
whose columns are generally known as the left singular 
vectors of X, D is a square diagonal matrix of size 
nlXnl whose diagonal elements are generally known as 
the singular values of X, V=V/jis the matrix whose 
columns are generally known as the right singular vec- 
tors of X, and the superscript t indicates the transpose of 
the matrix V. Moreover, the columns of the matrices U 
and V satisfy an orthogonality condition expressed by 
the equations: 


Uk,iUkj ~ 1 if i — j and 0 otherwise, and 


where i=l, . . . ,p and j-1, . . . ,nl 2 VkjVkj - 1 if / =y and 0 otherwise. ( } 

(b) determining a set of nl orthogonal singular vec- 
tors from the input matrix X such that the standard . . 

deviations of the projections of the input vectors along Associated with the i ! singular vector (i rA column of 
these singular vectors, as a set, are substantially maxi- 55 ) 1S a value, a real number X/ which is greater 

mized, the singular vectors each being represented as a than °* ec * ual to J tTO ; significance of each of the 
column of the orthogonal matrix °P tlmal ax f ls dlre fy elated to the magnitude of the 

corresponding singular value. Axes defined by singular 
y- y.j ( 2 ) vectors corresponding to larger singular values tend to 

60 “spread” the projections of the data in direct proportion 
where i, j = l, . . . ,nl, where the condition of orthogo- t0 magnitude of the singular value, 
nality is expressed by Depending on the problem, a number r of optimal 

axes will be used. According to a preferred embodiment 
V\jV\j+ v 2ti v 2 j+. . .+ if iH and o of the invention the number r may be determined adap- 

otherwise, (3) 65 tively from the data, but it can easily be selected and 

supplied manually by an operator. Also in the operator- 
and there being associated with each singular vector an supplied category is the number of hidden nodes n2 

associated singular value which is a real number greater which are to be used. With each hidden node, in a three- 
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layer feed-forward neural network, there is an associ- 
ated weight vector w2/, where i is the subscript of the 
hidden node, i= 1 , . . . ,n 2 . In virtually all conventional 
networks the vector w2 ,• can evolve in an arbitrary 
manner during the training process. The present inven- 5 
tion allows each weight w2/ to be adjusted only in a 
direction parallel to one of the optimal axes defined by 
the right singular vectors of X. Imposition of such a 
constraint dramatically decreases the number of optimi- 
zation variables associated with the connections be- 10 
tween the two layers. The conventional method re- 
quires optimization in a weight space of dimension 
nlXn 2 whereas the present invention reduces this num- 
ber to n 2 , the number of hidden nodes. Accompanying 
this reduction in the dimensionality of the optimization 15 
is a corresponding reduction in the number of opera- 
tions required for each training cycle. 

Periodically it may be necessary to pick different 
singular vectors along which the hidden weights are to 
be adjusted. The singular vectors themselves only de- 20 
pend upon the input matrix X, therefore these vectors 
need be extracted only once. The process of selecting 
the optimal singular vectors to use at each hidden node 
requires a number of operations less than or equal to one 
conventional forward propagation cycle. Simulations 25 
have shown excellent results even when the initial 
choice of singular vectors is never disturbed. A key 
discovery is that the weight vectors for many problems 
may be determined very quickly with great accuracy by 
only allowing the input weights to change along 1- 30 
dimensional subspaces defined by the right singular 
vectors of the input matrix. 

According to the present invention, therefore, instead 
of having to solve for all components of the hidden 
weight vectors, in the fast learning architecture of pres- 35 
ent invention, only the coefficients of the singular vec- 
tors (one such coefficient per node as opposed to hun- 
dreds or thousands per node in the conventional case) 
must be determined. The determination of these coeffi- 
cients can occur in any one of several ways. The present 40 
preferred embodiment uses a gradient descent with a 
back-tracking line search. 

In a feed-forward network with more than three 
layers it would be appropriate to treat nodes in all but 
the output layer as above; i.e., by imposing a con- 
strained weight representation. Additional layers, how- 
ever, would introduce additional processing overhead 
because the optimal view axes would have to be ex- 
tracted after each change in the weights of previous 
layers. The method according to the invention works 
especially well for 3 -layer networks because there are 
no weights preceding the input weights; therefore, once 
computed, the set of optimal view axes never change 
and, as will be detailed, it is easy to solve for output ^ 
weights directly using the ILS technique. 

Iterative Least Squares (ILS) 

Our objective is to find the best output weight matrix 
W3 for a given hidden weight matrix W2. From the 
input matrix X and the hidden weight matrix we can ^ 
obtain a matrix Z of hidden neuron outputs described by 

Zi=a n 2 (W 2 Xi) ( 7 ) 

where cr n is the transfer function applied coordinate- ^ 
wise i.e. 

€T««xi, . . . ,x„>)= «r(x 1), . . . ,tr(x rt )> where 
<r(x) 


is any one-to-one differentiable function generally 
known as the transfer function of the network, such 
functions being exemplified by the so-called sigmoidal 
function defmed by 

cr(x)= 1/(1 +exp(-x)) (9) 

and i = 1 , . . . ,p and Z/is the i th row of the Z matrix. We 
then must minimize the sub-function E 0 defmed by 

EM) = (<r„i(WiZd - Up (10) 

where ft/ is the desired output corresponding to the i rt 
input in the training set and the square of the vector 
quantity indicates the square of its magnitude. 

Let Q be the matrix of actual outputs whose \ th row 
Q/ is given by the equation 

Qi=<r<£W*Zd (11) 

Note that the ) th element of Q /, q/j, depends on W3 in 
a limited way, i.e., q^only depends on the j rA row of W3. 
To reiterate, the } th column of Q is a function only of the 
y th row of W3, therefore we can solve for the rows of 
W3 separately with no fear that solving for row j will 
disturb the optimality of any of the other rows. Let T / 
be the \ th row of W3. Then the vector T/ should mini- 
mize the expression 

Eo^icr^ZTd-nV (12) 

where ft* denotes the \ th column of the output matrix ft. 

There are many available techniques for solving this 
equation for T/ since the number of active optimization 
variables n2 is relatively small. One possible approach 
would be to use the well known Pinrose pseudo-inverse 
Z+ of the matrix Z defined by 

Z+=VD~ x U t (13) 

where Z=UDV / is the singular value decomposition of 
Z and any terms involving reciprocals of zero singular 
values are dropped. This technique yields 

r^z+op-k no (14) 

where cr „- 1 is the inverse of the one-to-one mapping 
<r n defmed by 

Y=<T H -'{X X ,....X n )=Yi Y n 05 ) 

and 

yi =o—\Xi) (16) 

The solution given by the equation for T/does not pro- 
vide a true minimization of the function E 0f / because the 
quantity which is really being minimized is the differ- 
ence between ZT/and c r p - *(ft*) rather than the distance 
between Op(ZT/) and ft' in the least squares sense. The 
preceding fails to account for the fact that the function 
cr p may distort the sensitivity of a post-cr error to a 
pre-cr error, thus the above equation for T/ might be 
trying for a close match of relatively insensitive coordi- 
nates which would force a mismatch in a more sensitive 
variable. The sensitivity problem might be overcome by 
including derivatives of cr p into the Z matrix. Specifi- 


cs) 
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cally, a new matrix T could be formed by multiplying 
Z on the left by cr p where <t p ’ is the Jacobian matrix of 
the mapping <r p . This approach has two important dis- 
advantages: First, there is no longer a single Z matrix to 
be used to obtain all n3 rows of the W3 matrix. This 
requires a singular value decomposition for different 
Z' matrices. Perhaps an even more serious problem is to 
find reasonable values for cr p . In order to know cry(X), 
it is necessary to know X, but X is ZT / and T/ is the 
weight vector for which a solution is sought. The tradi- 
tional solution to this dilemma is to iterate. 

Incremental Least Squares: 

When linearization techniques are to be employed it 
is desirable, perhaps essential, to have a shrinking inter- 
val over which linearization is to occur. This require- 
ment of a shrinking interval or increment over which to 
linearize the output transfer functions naturally gives 
rise to the concept of Incremental Least Squares (ILS). 
Suppose we are seeking a least squares solution to the 
equation 

G(X)=Y 07) 

i.e., (G(X)— y) 2 is to be minimized, where X and Y are 
vectors. Let G'(X) be the Jacobian matrix of G where 
the partial derivatives are evaluated at the point X. If an 
initial point X 0 is given, then we can linearize G about 
the point X Q as follows 

G{X 0 +h 0 )~ G(X 0 )+ G(X 0 )8 0 0 8) 

The increment 8 0 could be sought which moves G as 
close as possible to the desired value Y by assuming the 
linearization above and finding a least squares regres- 
sion solution for So. Such a solution would be 

8 0 =G+(XcKY-GiX 0 )) (19) 

We could then construct a sequence X ft Xi, . . . by 

X n — Xft— l + 6n— 1 (20) 

where 

X =G+{X n - i)(Y-G{X n . 0) (21) 

A desirable property of such a sequence is that the 
increments S„ are found which produce the minimum 
disturbance while moving toward the solution. We 
could apply this method directly to minimize the func- 
tions E 0t i, but it would be necessary to compute the 
matrix G'+(X n ) not only at each iteration step, but also, 
as observed previously, for each output node. If we 
further simplify the expression for G'+ then only one 
pseudo-inverse calculation will be required. Let the 
function G be defined by 

C{X)^a/ZX) (22) 

As observed previously, 

GT{X)=<r p \ZX)Z (23) 

If the diagonal Jacobian matrix cr p is replaced by a 
diagonal matrix with entries bounding those of cr p from 
above, then the resulting pseudo-inverse matrix pro- 
vides a conservative update; i.e., the update tends to 
undershoot rather than overshoot the desired minimum. 
Specifically, from the above equation for cr'(X), it fol- 
lows that the diagonal elements of <r p are never greater 


6 

than i when the customary sigmoidal non-linearty is 
used as the transfer function for the network. If a differ- 
ent transfer function is employed, then the constant } 
would be replaced by an upper bound for the derivative 
5 of <r. Combining the preceding we obtain the following 
sequence Xo, Xi, . , . which approaches the optimal 
output weight vector T,. 

X n —X n -\-\-b n -.\ (24) 

10 

where 

5/, — 1 =4Z + (ft l — cTf^ZX n _ 1 )) (25) 

15 This is termed a one-step method because the major 
overhead is the computation of the matrix Z+ which 
must be done only once. Though the method ignores 
information which could be obtained from the transfer 
function derivatives, sensitivity information is included 
20 in the form of the errors which are passed back into the 
increment 6*. The update described by the above equa- 
tion for S n _i is the Hessian update with the transfer 
function first derivatives replaced by J (the upper 
25 bound) and transfer function second derivatives re- 
placed by 0. The sweeping nature of the preceding 
simplifications makes further theoretical treatment of 
the sequence described by the equation for 8„-\ ex- 
tremely complex. The method succeeds because most of 
30 the important Hessian information is carried in the Z 
matrix rather than in the diagonal Jacobian matrix of 
transfer function first derivatives and the tensor of sec- 
ond transfer function derivatives, most of whose ele- 
ments are zero. 

35 The training procedure according to the present in- 
vention can therefore be summarized as follows: 

(1) Extract singular values and singular vectors from 
the input matrix X. 

(2) Based on the magnitudes of the singular values, 
40 make a judgement of how many singular vectors 

must be retained. 

(3) Decide how many hidden nodes to use. Note that 
the results of steps (1) and (2) will in general contrib- 
ute to this determination, which may be essentially an 

45 educated guess. 

(4) Set random values for the coefficients of the singular 
vectors which represent the input weights and the 
full matrix of output weights. 

(5) Perform a numerical optimization to find the set of 
coefficients of the singular vectors which yields the 
best set of input weights for the current (initially 
random) output weights. 

(6) Using the input weights derived from the coeffici- 
55 ents of singular vectors obtained in step (5) use the 

ILS procedure to solve for the output weights. 

(7) When no further decrease in network error can be 
obtained by applying steps (5) and (6), for each of the 
n2 hidden nodes, evaluate the learning potential p/j of 

50 each of the r singular vectors. The learning potential 
p ij of a singular vector is defined to be the absolute 
magnitude of the rate of change of the network error 
function with respect to changing weight W2/parallel 
to the j th singular vector, i= 1 , . . . ,n2, j = 1 , . . . ,r. 

65 (8) Choose a new singular vector for each of the n2 
hidden nodes according to which of the r singular 
vectors possesses the greatest learning potential for 
the particular hidden node, and initialize a new set of 
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coefficients for the new singular vectors to be all 
0.0’s. 

(9) Repeats steps (5), . . . ,(8) until the process stabilizes. 

(10) Convert the coefficients and singular vectors 
which describe the input weights into the form of 5 
weights which are compatible with conventional 
network architectures. Said conversion is accom- 
plished according to: 


Wl-iJ — 2 Ci, k Vj,k/\k 


where Q,* are the singular vector coefficients obtained 
in step (5), Vy,*is the matrix V of right singular vectors 
of the input matrix X,\* is the singular value of X 15 
and k is allowed to range over all indices from 1 to r for 
which the singular vector selection processes of steps 
(7) and (8) determined that singular vector k had maxi- 
mal learning potential for node i. Although computed 
by the ILS procedure rather than numerical optimiza- 20 
tion as is the conventional method, the output weights 
produced by the accelerated training method are di- 
rectly compatible with conventional network architec- 
tures. 

Although the training procedure according to the 25 
invention is not exactly equivalent to conventional back 
propagation, the weights which are produced at the end 
of the procedure are entirely compatible with ordinary 
back propagation networks. Simulations have shown 
that, even in cases for which the subject invention fails 30 
to produce acceptable weights, these weights can be 
used as an excellent starting point for a conventional 
training method. Starting a conventional method with 
weights found by the accelerated method can reduce 
the number of cycles required for final convergence by 35 
a factor of 10 in many cases, and can even cause the 
conventional method to converge on problems for 
which convergence of the conventional methods was 
never observed when the conventional method was 
forced to work from a random start. 4^ 

The training procedure according to the present in- 
vention is preferably employed under the following 
conditions: 

(1) The neural network is a 3-layer feed-forward net- 
work; i.e., a network with one input layer, one hidden 45 
layer and one output layer. 

(2) The sizes of the layers are such that a significant 
amount of computation occurs in the connections 
between the input layer and hidden layer. 

Cost Comparisons: 50 

The cost estimate C 0 for application of the standard 
gradient descent training method for C cycles through 
a data set of p examples may be calculated as follows: 

C 0 =Cpn2(n\ + nl) (27) ^ 

The comparable cost C n for the training method ac- 
cording to the present invention is given by: 

C n =S+ CXn2n3+Zn2(n\ +n3)), (28) 

60 

where £ is the fraction of cycles which require an evalu- 
ation of the learning potential described in (7) and (8), 
and S is the cost of performing the singular value de- 
composition of the input matrix X. Note that the cost of 
the singular value decomposition is not multiplied by 65 
the number of cycles because it is only necessary to 
perform the singular value decomposition once at the 
outset of the training process. Moreover, the singular 


value decomposition need not be recomputed if outputs 
or network architecture are modified. 

The preferred embodiments of the present invention 
will now be described with reference to the accompa- 
nying drawings. 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 is a representational diagram of a single artific- 
ial neuron or “node” whose output is a function of the 
input. 

FIG. 2 is a representational diagram of a back propa- 
gation neural network having three layers: an input 
layer, a hidden layer and an output layer. 

FIGS. 3a and 3 b are perspective views of a hoop, 
viewed nearly edge on (FIG. 3 a) and viewed from the 
side (FIG. 3 b) t which illustrate an optimal view, i.e., the 
effect of viewing an object from an optimal perspective. 

FIGS. 4a and 4 b show a representation of neural 
network weight vector being allowed to evolve in an 
unconstrained manner (FIG. 4a) and being constrained 
to evolve along a preferred optimal axis only (FIG. 4b). 

DESCRIPTION OF THE PREFERRED 
EMBODIMENTS 

The preferred embodiments of the present invention 
will now be described with reference to FIGS. 1-4 of 
the drawings. 

Feed forward, back propagation neural networks are 
well known in the art. Such networks comprise a plural- 
ity of artificial “neurons” or “nodes” connected in a 
highly parallel manner. The key to the functioning of 
such a “BPN” is the set of weights associated with each 
node, which vary to determine the level of association 
between nodes. It is these weights that represent the 
information stored in the system. 

A typical artificial neuron is shown in FIG. 1. The 
neuron may have multiple inputs, but only one output. 
The input signals to the neuron are multiplied by the 
weights and summed to yield total neuron input I. For 
the \ th neuron shown in FIG. 1, the neuron input I and 
output O are given by: 

I j— Neuron input = oyWyl j (29) 

0/=Ncuron Output =1/(1 (30) 

where j identifies the source of the signal I j to the 
weight W ij. The neuron output may be a so-called “sig- 
moid” function of the input: 

1/(1 +e“*). 

The sigmoid is, in some respects, representative of 
real neurons, which approach limits for very small and 
very large inputs. Each neuron may have an associated 
“threshold” e which is subtracted from the total input I 
so that x=I,-— e. It is customary in the art to treat these 
thresholds as weights leading from an input fixed at 
unity to the threshold neuron. This treatment of thresh- 
olds allows the method according to the subject inven- 
tion to be directly applicable to neural networks with or 
without thresholds. 

There are several known neural net learning algo- 
rithms, such as back propagation and counter propaga- 
tion, which are used to train networks. The program- 
mer “trains” the net by supplying the input and corre- 
sponding output data to the network. The network 
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learns by automatically adjusting the weights that con- 
nect the neurons. The weights and the threshold values 
of neurons determine the propagation of data through 
the net and its response to the input. 

FIG. 2 shows a back propagation network compris- 5 
ing an input layer having four nodes (nl=4), a hidden 
layer having six nodes (n2=6) and an output layer hav- 
ing two nodes (n3=2). From the number of connections 
in this simple network, it will be appreciated that train- 
ing the net to the correct responses is normally a com- \q 
putationally expensive process. The purpose of the pres- 
ent invention is to reduce this computation time and 
expense. 

Theoretical Basis of the Invention: 

The invention operates by performing a sequence of 
partial optimizations in weight space which are of two 
types. Each type of partial optimization may be viewed 
as a partitioning of the network weights into two or 
more classes, performing optimization on one class at a 
time, and proceeding from class to class according to an 20 
iteration strategy. The simpler partial optimization con- 
siders the connections between the hidden and output 
layers separately from those from the input to the hid- 
den layer. The output connections can be found by the 
ILS procedure because these connections have known 25 
outputs and inputs which are also known if the hidden 
weights are assumed, i.e., excluded from the partial 
optimization. 

The other kind of partial optimization involves de- 
composing the input weight space in a manner which 
provides the optimal view of the input data. This de- 
composition also determines a partial optimization strat- 
egy during which the hidden weights are constrained to 
change along one-dimensional subspaces as shown in 
FIG. 4 b. This constraint limits the active optimization ^ 
variables during each step to a single coefficient for 
each hidden node. 

The optimal axes for the hidden weight space decom- 
position are the right singular vectors of the input ma- 
trix X. To illustrate this concept of optimality FIGS. 3a 
and 3 b show two views of a two-dimensional hoop 
imbedded in a space of three or more dimensions. If the 
rows of the input matrix X were to contain random 
samples from the hoop, then the first two right singular 
vectors of X (the two corresponding to the largest sin- 
gular values) would be oriented in the plane of the 4 
hoop. If the row vectors of X were then projected along 
the axes defined by the first two singular vectors of X, 
and the projections were plotted in two-dimensional 
space, then the result would be the hoop laid flat and 
thus most visible in the two-dimensional plane. 50 

Advantages Over The Standard Method: 

The major advantage of the training procedure ac- 
cording to the present invention is reduced training 
cost. Note that the training costs given above suggest 
that the accelerated training method will never be more 55 
costly than the conventional method provided that 

£<(nl — l)/nl. 

Clearly lessor values of the parameter £ or greater 60 
values of the parameter nl indicate circumstances in 
which the method according to subject invention 
should be considered. 

Illustration of the Invention: 

The nature and operation of the present invention is 65 
illustrated in FIGS. 3 and 4. FIGS. 3a and 3 b show two 
views of a circular hoop in space. FIG. 3a presents the 
hoop nearly edge-on whereas FIG. 3 b, which is the 
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optimal view, displays the hoop as a circle, thus provid- 
ing much more information about this device. 

With the present invention, the axes defined by the 
singular vectors corresponding to the larger singular 
values tend to “spread” the projections of the data so 
that the true nature of the data becomes apparent. The 
singular vectors extracted from the inputs are thus used 
to quickly find the optimal projections or views of the 
data. 

FIG. 4a shows standard weight vectors which can 
evolve in an arbitrary direction. According to the in- 
vention, the hidden weight vectors are constrained to 
evolve through linear subspaces (FIG. 4 b) which 
greatly reduces the amount of computation since, in- 
stead of having to solve for all components of the hid- 
den weight vectors, only the coefficients of the singular 
vectors (one such coefficient per node as opposed to 
hundreds or thousands per node in the conventional 
case) must be determined. 

Software Implementation: A software implementa- 
tion of the present invention is set forth in the attached 
Appendix. 

Software Description: The program is written in the 
C computer language and is intended to be ported to 
IBM compatible personal computers with TURBO C, 
Berkeley UNIX work-stations as well as most computer 
systems with C language compilers. 

To Compile: 

TURBO C: 

Edit the file “flub.h”, and if necessary, change the 
definition of “TBC” to read “#define TBC 1” 

At the system command prompt type the instruction 
“tcc -mh flub.c sing_val.c” 

This instruction will cause the creation of the three 
files “fiub.obj”, “sing_val.obj” and “flub.exe”. To run 
the program type “flub” at the system command 
prompt. 

Berkeley UNIX Work-stations: 

Edit the file “flub.h” and if necessary change the 
definition of “TBC” to read “#define TBC 0” 

At the command line prompt type the instruction “cc 
-g flub.c sing_val.c -lm -o flub” 

This command will create the three files “flub.o”, 
“sing_val.o” and “flub”. To run the program type 
“flub” at the command prompt. 

Running the Program: 

The program only requires a file containing the input- 
/output pairs (i/o pairs) which will be used to train the 
network. This file should contain decimal numbers in 
ASCII text in the form required by the Network Execu- 
tion and Training Simulator (NETS), a product of the 
National Aeronautics and Space Administration 
(NASA), and available from COSMIC, 382 East Broad 
Street, Athens, Ga. 30602. The name of this file should 
have the extension “.iop”. A second optional file, “de- 
scribe.net” may be used to facilitate execution of the 
program. If present, this file should contain three lines 
with the following information. 

LI: Seed for pseudo random number generation (if 
blank program will use system clock for this purpose) 
L2: Numbers of (a) inputs, (b) outputs, (c) hidden 
nodes, and (d) singular vectors to use. The program will 
prompt for (c) and (d) if not present in the file. Items (a) 
and (b) are mandatory. 

L3: The name of the “.iop” file written without the 
“.iop” extension, e.g. to use the file “pattem.iop” this 
line should read “pattern”. 
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The program will prompt for this input if not present NETS Compatibility : 


in the file. 

At start-up the user is given the option of loading I/O 
from a binary work-file. This will generally be much 
quicker than loading the training set from an ASCII file. 
The work-file is automatically created when the ASCII 
file is processed so that you need only read the ASCII 
file once, thereafter use the work-file. 

The user communicates with the program through a 
simple command line interface. Each command is given 
by typing a letter followed by a carriage return. All but 
two of the commands are self-explanatory. These two 
commands for training and ‘d’ for choosing the “de- 
scent mode”. 

Training: 

When you select the ‘t’ option you are executing the 
steps of the Accelerated Training Method. This action 
requires two inputs from the user, counts of “major” 
and “minor” cycles. Major cycles are defined to be 
those in which the learning potential is evaluated, and 
thus are much more costly than minor cycles. The num- 
ber of minor cycles will be interpreted as the number of 
partial optimization steps between major cycles. The 
ILS solution for the output weights is done once in 
every major cycle. The number £ which governs the 
relative cost of the Accelerated method in comparison 
to the conventional method is simply the ratio of major 
cycles to minor cycles. 

£= (major cycle count)/(minor cycle count) 

Descent Selection: 

This software implementation of the Accelerated 
Training Method offers two numerical methods for 
determination of optimal coefficients for the singular 
vectors which determine the hidden weights. The gra- 
dient method uses straight gradient descent with back- 
tracking line search. The direct method uses a linearized 
Hessian method in which the sigmoids are replaced by 
locally linear mappings. The direct method is more 
costly, but will require fewer minor cycles to produce 
optimal coefficients. The cost of the direct method 
increases rapidly with n2, the number of hidden nodes, 
and thus the gradient method is usually preferable for 
large networks. 


As noted, the i/o files for this program are compatible 
with those for NETS. If the weights generated by this 
program are to be used by NETS, then the following 
5 rules must be followed. 

(1) The number of nodes in layer 0 (NETS* input layer) 
must be equal to nl (item (a) in line 2 of “describe.- 
net”). 

10 (2) The number of nodes in layer 1 (NETS’ output 
layer) must be equal to n3 (item (b) of line 2 in “de- 
scribe.net”). 

(3) The number of nodes in layer 2 (a hidden layer for 
NETS) must be equal to n2 (item (c) on line 2 of 
15 “describe.net”). 

l (4) The network implemented by NETS must be fully 
connected. 

(5) The network implemented by NETS must have 3 

20 l a y ers - 

The name of the weight file created by this program 
automatically has the “.pwt” extension, and as such, is 
compatible with the ‘p’ (portable) format for NETS 
weight files. The weights determined by this program 
25 are generated for a network with no thresholds (biases) 
but are stored in a fashion which renders them compati- 
ble with networks with or without thresholds. Even for 
networks with no biases, NETS requires bias values to 
be included in weight files. The program includes bias 
30 values in its weight files as 0.0’s. 

There has thus been shown and described a novel 
training method for feed-forward, back propagation 
neural networks which fulfills all the objects and advan- 
tages sought therefor. Many changes, modifications, 
35 variations and other uses and applications of the subject 
invention will, however, become apparent to those 
skilled in the art after considering this specification and 
the accompanying drawings which disclose the pre- 
ferred embodiments thereof. All such changes, modifi- 
40 cations, variations and other uses and applications 
which do not depart from the spirit and scope of the 
invention are deemed to be covered by the invention, 
which is to be limited only by the claims which follow. 


d * u - 2 .G*v + w; 
if (d > 0.0) 

return (0 . 5*h* (w-4 . 0*v+3 . 0*u) /d) ; 
else 

if (u < w) 
return 0.0; 

else 

return 2.0*h; 

1 /* end best_r */ 

void update (r) 
float r; 

{ 

int i; 

if ( ! stop_f lag) 

for (i - 0; i < size_hidden; i++) 
hidden^weight^coefficient [i]— »<r*delta_c{i] ) ; 
} /* end update */ 

void restore (r) 
float r; 

{ 

int i; 
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if (!stop_flag) 

for (i - 0; i < size__hidden; i++) 
hidden_weight_coefficient [i]+- (r*delta_c [i] ) ; 

} /* end restore */ 

void compute_output_w eights () 

{ 

int i, j, k, n, rnk; 
float s, tO, tl; 

static float *y - NULL, *z, *temp_l, *temp_2; 

if (y — NULL) 

{ 

y- (float* ) mem_alloc (num_pairs*sizeof (float) ) ; 
z - ( float *)mein_alloc (num^ pairs*sizeof (float) ) ; 
temp_l - (float*) mem_alloc (size_hidden*sizeof (float) ) ; 
temp_2 - (float*) mem_alloc(size_hidden*sizeof (float)) ; 

) /* end if */ 

for (i - 0; i < size_hidden; i++) 
for (j - 0; j <- i; j++) 

{ 

vip[i] (j) - 0.0; 

for (n » 0; n < num_pairs; n++) 

vip [i] [ j]+-(hidden_output(n) (i]*hidden output [n] [j]); 
vip[ j] (i) - vip (i] l j] ; 

} /* end for j */ 

sing_val (size_hidden, size_hidden, vip, value, evip) ; 

for (rnk - 0; (rnk<size_hidden) ££ (value [rnk] >machine_zero) ; rnk++); 

for (k - 0; k < size_output; k++) 

{ 

for ( ; ; ) 

{ 

tO - 0.0; 

prop_vector (sizejiidden, num_pairs, weighty! [k] , hidden_output, y) 
for (n - 0; n < num_pairs; n++) 

{ 

yfn] - 4 .0* (outpt [n] (k) -sigmoid (y [n] ) ) ; 
tO+-square (y (n] ) ; 

) /* end for n */ 
t0/-16.0; 

prop_transpose (size_hidden, num_pairs, y, hidden_output, temp_l) ; 
prop_transpose (size__hidden, size_hidden, temp 1, vip, temp 2) • 
for (i - 0; i < rnk; i++) ~ ~ 

temp_2 [i] /-value (i] ; 

prop_vector (rnk, size_hidden‘, temp_2, evip, temp 1); 
for (i - 0; i < size_hidden; i++) 
weight l(k](i] +- temp l(i]; 
tl - 0.0; 

prop_vector {siie_hidden, num_pairs, weight_l (k] , hidden output, y) 
for (n - 0; n < num_pairs; n++) 

{ 

s - sigmoid (y (n] ) -outpt (n] [k] ; 
tl+-square (s) ; 

) /* end for n */ 

if (tl< (num_pairs* .0000001) ) 

break; 

if (tl > (t0*0 . 98) ) 

(. 

for (i - 0; i < size_hidden; i++) 
weigh t_l [ k ] [i] — temp_l(i]; 
break; 

) /* end if */ 

} J* end for ;; */ 

) /* end for k */ 

) /* end compute_output_w eights */ 

void direct_hidden () 

{ 

int i, jO, jl, n;, 
float d, t; 

static float *temp-NULL; 
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if (temp — NULL) 

temp - ( float *)mem_alloc (size_hidden*sizeof (float) ) ; 
for (jO - 0; jO < size_hidden; j0++) 
for (jl - 0; jl <- jQ; jl++) 

{ 

vip I jO) [ jl] - 0.0; 

for (n - 0; n < num_pairs; n++) 

{ 

t - d_sigmoid (hidden_activation [n] [ jO] ,hidden_output [n] [ jO] ) * 

d_sigmoid (hidden^ activation (n) [jl], hidden_output [n] [ jl] ) *inpt [n] [j0]*inpt[n] [jl] 
for (i - 0; i < size_output; i++) 

{ ; 

d - d_sigmoid(output_activation[n] [i] , net_out [n] [i] ) ; 

vip [ jO] [ jl] +-(t* weigh t_l[i) [ jO] *weight_l [i] [ jl] *square(d) ) ; 

} /* end for i */ 

) /* end for n */ 
vipljl] [jO] - vipljO] [jl]; 

. } /* end for jl */ 
gradient {); 

sing__val (size_hidden, size_hidden, vip, value, evip) ; 
for (i - 0; (i<size_hidden) fit (value [i]>machine_zero) ; i++) ; 
prop_transpose (size_hidden, size_hidden, deltaic, vip, temp) ; 
for (jO - 0; jO < i; j0++) 
tempi jO] /-value [jO] ; 

prop_vector (i, size_hidden, temp, evip, delta_c) ; 

} /* end direct_hidden */ 

void learn (c) 
int c; 

{ 

int i, n; 

float eO, el, e2, rl; 
propagate () ; ' 

for (i - 0; (i<c) fit lstop_flagfifi (alpha>machine_zero) ; i++) 

{ 

eO - rms ( ) ; 

(*descent_method) () ; 

update (alpha); 

propagate (); 

el - rms () ; 

update (alpha) ; 

propagate (); 

e2 - rms () ; 

restore (2 . 0*alpha) ; 

while ( l stop_flagfit (el<e0) fifi (e2<el) ) 

el - e2 ; 

alpha*-2.0; 
update (2.0*alpha); 
propagate (); 
e2 - rms () ; 
restore (2.0*alpha); 

} /* end while */ 

while <!stop_flagtt (el > eO) fit (alpha > machine_zero) ) 

-* e2 - el; 
alpha *-0.5; 
update (alpha) ; 
propagate () ; 
el - rms () ; 
restore (alpha} ; 

} /* end while */ 

rl - best_r (alpha, eO, el, e2); 

update (rl) ; 

propagate (); 

alpha - rl; 

) /* end for i */ 
if (i<c) 

printf( w %d passesXn", i); 
compute_output_weights ( ) ; 
add_sv ( ) ; 
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for (n - 0; n < numjpairs; n++) 
for (i - 0; i < size_hidden; i++) 

old_hidden_activation [n] (i) - hidden_activation [n] [i] ; 
set_key (rank) 
alpha -1.0; 

) /* end learn */ 

void cycle (major, minor) 
int major, minor; 

{ 

int i; 

float local_time; 

local_time - clock () /CLKJTCK; 
for (i - 0; (i<ma jor) * £ I stop_flag; i++) 

( 

learn (minor) ; 
propagate (); 

printf (" rms error - %f\n maximum error - %f\n", 
rms () , max_error () ) ; 

) /* end for i */ 

local_time - clock () / CLKJTCK- local_time; 
total_time+-local_time; 
total_cycles+-i; 

) /* end cycle */ 

void save_weights () 

{ 

int i, j; 
float t; 

FILE *weight_file; 
coefficient_type *p; 

printf (" save weights to file <%s> ", filename); 
if (strlen(gets (strl) ) ) 
strcpy (filename, strl) ; 
while ( (weight_file-f open (strcat (filename, ".pwt"), «w M ) }«— NULL) 
l 

printf ("file %s didn't open for output\n file name> ", filename); 
gets (filename) ; 

} /* end while */ 

filename (strlen (filename) -4] - '\Q'; 
for (i - 0; i < size_hidden; i++) 

for (j - 0; j < size_input; j++) 

( 

t - 0.0; 

for (p - weight_matrix[i] ; p !- NULL; p - p->next) 

t+-(p->coeff*singular_vector[ j] [p->svdnum] /singular value [p->svdnum) ) ; 
fprintf (weight^file, " %f\n", t) ; 

) /* end for j 7 / 

for (i - 0; i < size_output; i++) 

for (j - 0; j < size_hidden; j++) 

fprintf (weight^file, " %f\n M , weight_l [ i) [ j) ) ; 

for (i - 0; i <~size_input+size_hidden; i++) 

fprintf (weighty file, " %f\n", 0.0); 

f close (weightjf ile) ; 

) /* end save_weights */ 

void teach () 

{ 

int major-1, minor-5; 

printf (" press 'q' to quit\n") ; 

• for (;;) 

( 

stop_flag - false; 

printf (" enter counts for major and minor cycles <%d %d> ", major, minor) 
sscanf (gets (strl) , "%d%d", *ma jor, fcminor) ; 
if (strl {0] —'q' ) 
break; 

cycle (major, minor); 

) /* end for ;; */ 

) /* end teach */ 
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void display_time_and_cycles () 

{ 

printf (" %ld cycles in %8.2f seconds\n", totally cles, total_time) ; 
) /* end display_time_and_cycles */ 

void break_handle ( ) 

{ 

signal (SIGINT, break_handle) ; 
stop_flag - true; 

} /* end break_handle */ 

void change_descent_f unction O 

{ 

if (descent_method — gradient) 

{ 

printf (" Now using direct approximation^"); 
descent_method - direct_hidden; 

} /* end if */ 
else 
{ 

printf (" Now using gradient method\n”) ; 
descent_method - gradient; 

} /* end else */ 

) /* end change_descent_method */ 


void menu ( ) 


( 

printf ( M c 
printf d 
printf (" s 
printf (" t 
printf (" h 
printf {" q 
) f* end menu * 


display total cycles and training time\n"); 
change descent method\n M ); 

Save weights\n") ; 

Train network\n") ; 
print this menu\n"); 
quit program\n H ); 

/ 


main () 

{ 

signal (SIGINT, break_handle) ; 
descent_method - gradient; 
allocate_net () ; 

printf (" press 'h' for help\n"); 
for (strl [0]-' \Q' ; ? ) 

( 

printf (" command> M ) ; 
if (gets(strl) [0]~ 'q') 
break; 

switch (strl[0]) 

{ ' 

case ('t' ) : 

{ 

teach (); 
break; 

) /* end 't' */ 
case (' c' ) : 

{ 

display_time_and_cycles () ; 
break; ~ 

} /* end 'c' */ 
case C'd'): 

( 

change_descent_f unction (); 
break; 

) /* end 'd' */ 
case (' s' ) : 

{save_weights () ; 
break; 

) /* end 's' */ 
case {'h'): 

{ 

menu () ; 
break; 
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} /* end 'h' */ 

default: 

break; 

} /* end switch */ 

} /* end for ;; */ 
fclose (description^ ile) ; 

} /* end main */ 

/* File: flub.h — a header file for the */ 

/* Fast Learning Utility for Backpropagation. */ 

/* by Dr. Robert 0. Shelton */ 

/* A product of the Software Technology Branch, NASA/JSC */ 

/* History: The concepts leading to the development of this program were */ 

/* formulated by the author during the months of July and August 1990. 

The bulk of this code was written in the interval September 1-8 1990 */ 

/* Testing and further refinements were carried out from September 8 to */ 

/* September 18 1990 */ 

/* Learning-potential-based vector switching added September 19-24 */ 

/* Direct solution for hidden weight coefficients added */ 

/* September 25 - October 1 1990 */ 

/* Testing and minor revisions October 2 - October 19 1990 */ 

♦include <stdio.h> 

♦include <string.h> 

♦include <math.h> 

♦include <signal.h> 

♦include <time.h> 

♦define TBC 0 
♦if TBC 

♦include <alloc.h> 

♦define r_b 15 
♦define GIGANTIC huge 
♦else 

♦include <malloc.h> 

♦define r b 15 
♦define GIGANTIC 
♦endif 

♦ifndef CLK TCK 

♦define CLK~TCK 1000000.0 

♦endif 

♦define true '\Q1' 

♦define false '\0' 

♦define infinity 1000000.0 

♦define machine_zero 0.00001 

♦define check_break {) (printf(" \b M )) 

♦define clip(l,h,x) ( ( (x) < (1) ) ? (1) : ( (x) > (h) ) ? (h) : (x) ) 

♦ define min (a, b> (((a)<(b))? (a): (b) ) . 

♦define square (x) <(x) # (x)) 

♦define sigmoid(x) (1 .0/ (1 . 0+exp (- (x) } ) ) 

♦define. d_sigmoid (x,y) ( (y) • (1.0- (y) ) ) 

♦define f rand (x, y) ( (x) + ( (y ) - (x) ) * ( (rand ( ) 6 ( (ll«r_b) -1 ) ) /\ 

(float) (ll«r_b))) 

typedef char string (256); 

typedef struct c_t 

. t 

int svdnum; 
float coeff; 
struct c__t *next; 

} /* end struct */ 
coefficient_type; 

EXTERN int num_pairs, size_input, size_hidden-0, size_output, 
rand_seed-0, rank-0, *key, stop_flag; 

EXTERN long total_cycles-01; 

EXTERN float alpha, total_time-0 . 0; 

EXTERN float rand_lim_0, rand_lim_l; 

EXTERN float **err_vector, *hidden_weight_coef f icient, *delta_c, **hidden_err, 
GIGANTIC**inpt, GIGANTIC**outpt, **net_out, **hidden_output , 
**hidden_activation, **old_hidden_activation, **output_activation; 

EXTERN float GIGANTIC* io^base, * singula revalue, **singular_vector, 

**vip, *value, **evip, ***weight_l, **delta_ w__l; 

EXTERN FILE * de script ion_f ile; 



5,228,113 


24 


23 

EXTERN string filename, strl; 

EXTERN coefficient_type **weight_matrix; 

EXTERN void (*descent_method) () ; 

FILE *in_file<); 

FILE *out_file<>; 
char *memTalloc () ; 
char GIGANTIC*long_jnem_alloc() ; 
float dotO; 

float learning_potential ( ) ; 
float diagonal_gradient_norm() ; 
float maxjerrort); 
float rmsT) ; 
float best_r{); 
int get_svd(); 
void bubble (); 
void compute_svd ( ) ; 
void sing_val(); 
void print_learning_stats {) ; 
void set_)cey(); 
void add_sv(); 
void get_iop(); 
void save__svd(); 
void prop__vector () ; 
void prop_transpose {) ; 
void allocate_hidden() ; 
void allocate_net () ; 
void propagate (); 
void gradient (); 
void update (); 
void restore (); 

void compute_output_weights () ; 
void direct_hidden() ; 
void learn (); 
void cycle O; 
void save_weights () ; 
void teach {); 

void display_time_and_cycles () ; 
void break_handle () ; 
void change_descent_f unction ( ) ; 
void menu {) ; 

/* File: sing_val.c — Auxiliary numerical support routines for FLUB */ 

/* by Dr. Robert 0. Shelton */ 

/* A product of the Software Technology Branch, NASA/ JSC */ 

♦include <stdio.h> 

♦include <math.h> 

extern char *mem_alloc () ; 

static float at,bt,ct; 

♦define PYTHAG (a, b) ( (at-fabs (a) ) > (bt-f abs (b) ) ? \ 

<ct-bt/at,at*sqrt ( 1 . 0 +ct*ct) ) : (bt ? (ct-at/bt,bt*sqrt ( 1 . 0 +ct*ct) ) : 0 . 0 )) 
static float maxargl,maxarg 2 ; 

♦define MAX (a, b) (maxargl- (a) ,maxarg2- (b) , (maxargl) > (maxarg2) ?\ 
(maxargl) : <maxarg 2 )) 

♦define SIGN (a, b) {(b) >- 0.0 ? fabs(a) : -fabs (a)) 

void bubble (m, n, a, w, v) 
int m, n; 

float **a, *w, **v; 

( 

i nt i , j , k 7 
float t; 

for (i - 1 ; i < n; i++) 

for (j - (n- 1 ) ; j >- i; j— ) 

if (fabs (w[ j] ) < fabs (w[ j+ 1 ] ) ) 

{ 

t - w [ j ] ; 
wtj] - w ( j+ 1 ] ; 
w[j+l] - t; 
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for (k - 1; k <- n; k++) 

{ 

t - v [k] l j] ; 
v[k][j] - v[k] ( j+1] ; 
v[k] lj+1] - t; 

) /* end for k */ 
for (k - 1; k <- m; k++) 

{ 

t - a [k] [ j] ; 
a [k] [ j] - a [k] [ j+1] ; 
a [k] [ j+1] - t; 

} /* end for k */ 

} /* end if */ 

} /* end bubble */ 

void compute_svd(m, n, a, w, v) 
float **a,*w,**v; 
int m,n; 

{ 

int f lag'/ i / it s f j t j j , k 1 1 / nm; 
float c, f,h,s,x,y, z; 
float anorm-0 .0,g-0 .0,scale-0.0; 
float *rvl; 


/* 

printf ("computing svd\n"); 

*/ 

rvl- (float* )mem__alloc (n*sizeof (float) ) ; 

/* correct pointers */ 
a — ; 
v — ; 
w — ; 

for (i - 1; i <- m; i++) 
a ( i ] — ; 

for (i - 1; i <- n; i++) 
v [i] — ; 
rvl — ; 

for (i-l;i<-n;i++) l 
1-i+l; 

rvl [i]-scale*g; 
g-s-scale-0 .0; 
if (i <- m) { 

for (k-i;k<-m;k++) scale +- fabs (a [k] (i] ) ; 
if (scale) { 

for (k-i;k<-m;k++) { 

a(k] [i] /- scale; 
s +- a [k] [i] *a[k] [i] ; 

1 

f-a[i][i]; 

g - -SIGN (sqrt (s) , f ) ; 

h-f*g- 3 ; 

a[i][i]-f-g; 
if (i !- n> { 

for { j-1; j<-n; j++) { 

for <3-0.0,k~i;k<-tn;k++) s +- a[kl{ 
f-s/h; 

for <k-i;k<-m;k++) a[k] [j] +- f*a[k 

1 

} 

for (k-i;k<-m;k++) a[k] [i] *- scale; 

• ) 

) 

w[i]-scale*g; 

g-s-scale-0, 0; 

if (i <- m fit i !- n) { 

for (k-l;k<-n;k*H») scale +- fabs (a [i] [k] ) ; 
if (scale) { 

for ( k-1 ; k<-n ; k++ ) { 

a[i][k] /- scale; 
s +- a(i] [k]*a(i](k]; 

} 

f-a[i] [1] ; 

g - -SIGN (sqrt (3}. # f ) ; 
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h-f*g-s; 

for (k-l;k<-n;k++) rvl[k]-a[i][k]/h; 
if (i !- m) { • 

for (j-1; j<-m; j++) { 

for (s-0.0,k~l;k<-n;k++) s +- a(j]| 
for (k-l;k<-n;k++) a(jHk) +- s*rvl 

1 

1 

for (k-l;k<-n;k++) a(i] Ik] *- scale; 

) 

> 

anorm-MAX (anorm, (fabs (w (i] ) +fabs (rvl(i] ) ) ) ; 

) 

for { 

if (i < n) { 

if (g) l 

for ( j-1; j<-n; j++) 

v[j] (il-Uti] [jl/a(il tm/g.* 
for { j-1; j<-n; j++) { 

for (s-0.0,k-l;k<-n;k++) a +- a[i] [k]*v[k] | 
for (k-l;k<-n;k++) v[k] I j] +- s*v[k][i); 

) 

) 

for {j-1; j<-n; j++) v{i] { jl-v{ jl{il-0.0; 

) 

v[i][i]-1.0; 
g-rvl [i] ; 

1-i; 

) 

for (i-n;i>-l;i — ) { 

1-i+l; 

g-w(i]; 
if (i < n) 

for (j-1; j<-n; j++) a(i] ( j]— 0.0; 

if (g) { 

g-l.O/g; 
if (i !- n) { 

for ( j-1; j<-n; j++) { 

for (s-0.0,k-l;k<-m;k++) s +- a (k] [i] *a |k] [ 
f-<s/a[i][i])*g; 

for (k-i;k<-m;k++) a[k] { j] +- f*a[k][i]; 

) 

for (j-i; j<-m; j++) a(j][i] *- g; 

) else { 

for ( j-i; j<-m; j++) a ( j] [i]— 0.0; 

) 

++ a(i] til ; 

) 

for (k-n;k>-l;k — ) { 

for (its-l;its<-30;its++) { 
flag-1; 

for (1— k;l>— 1;1 — ) { 
nm-1-1; 

if (fabs(rvl{l])4anorm — anorm) { 
flag-0; 
break; 

) 

if (f abs (w [nm] ) +anorm -- anorm) break; 

) 

if (flag) ( 

c-0.0; 

s-1.0; 

for (i-l;i<-k;i++) { 
f-s*rvlti]; 

if (fabs (f)-t-anorm i- anorm) { 
g-wli]; 

h-PYTHAG (f , g) ; 
w[i]-h; 
h-l.O/h; 
c-g*h; 
s-(-f*h) ; 
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for ( j-1; j<-m; j++) { 
y-alj][nm); 
z-a(j)(i]; 
a[j] [nm]-y*c+z*s; 
a[j] (i]-z*c-y*s; 


z-w[k]; 
if (1 - 


k) 

if 


} 


{ 

(z 


0 . 0 ) ( 
v[k) - -z; 
for (j-1; jo 


n; j++) v[jnk]-(-vtj][k)) 


printf (" sing_vals no 


break; 

) 

if (its — 30) 

comvergence after 30 iterations !\n") ; 

x-w[l] ; 

nm-k-1; 

y-w (nm) ; 

g-rvl (nm) ; 

h-rvl(k); 

f-( (y-z) * (y+z ) + (g-h) * (g+h) ) / (2.0*h*y) ; 
g-P YTHAG ( f , 1 . 0 ) ; 

f- ( (x-z) * (x+z) +h* ( (y/ (f+SIGN (g, f ) ) ) -h) ) /x; 
c-s-1 .0; 

for ( j-1; j<-nm; j++) ( 
i-j+X; 
g-rvl (i) ; 
y-w(i] ; 
h-s*g; 
g-c*g; 

z-PYTHAG(f,h) ; 

rvl ( j]-z; 

c-f/z; 

s-h/z; 

f-x*c+g*s; 

g-g*c-x*s; 

h-y*s; 

y— y *c; 

for ( j j-1; j j<-n; j j++) { 
x-vCjjl Ij] ; 
z-vljj] (i); 
v(jj] ( j]-x*c+z*s; 
v( j j] [il-z*c-x*s; 

) 

z-PYTHAG(f,h); 
w[j]-z; 
if (z) ( 

z-l.O/z; 

c-f*z; 

s-h*z; 

} 

f-(c*g) + (s*y) ; • 
x-(c*y)-(s*g) ; 
for (jj-1; j j<-m; jj++) { 
y-aljj] (j); 
z-a[jj][i]; 
a[jj](j]-y*c+z*s; 
a(jj][i)-z*c-y*s; 

: ) 


) 

rvl (l)-0.0; 
rvl [k]-f ; 
w(k)-x; 


) 

rvl++; 
free (rvl) ; 
/* 
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printf ( M sorting singular valuesW); 
*/ 

bubble (m, n, a , w, v) ; 
for (i - 1; i <- m; i++) 
a (i]++; 

for (i - 1; i <- n; i++) 
v [i] ++; 

w++ ; 

} /* end computers vd */ 

void sing_val (m, n, a, sigma, v) 
int m, n; 

float **a, *sigma, **v; 

{ 

int i f j; 

if (n > m) 

{ 

for (i - 0; i < m; i++) 
for (j - 0; j < n; j++) 

( 

v[j) [i] - a[i] [j]; 
a[i][j] - 0.0; 

) /* end for j */ 

computers vd (n, m, v, sigma, a) ; 

•} /* end if */ 
else 

compute_svd (m, n, a, sigma, v) ; 

} /* end sing_val */ 


What is claimed is: 

1. Apparatus for training a feed forward neural net- 
work having at least two layers of nodes, with a first, 
input layer having nl nodes and a second, hidden layer 
having n2 nodes, each node i of said hidden layer hav- 
ing a weight vector W2/ t where i= 1, . . . ,n2, said appa- 
ratus comprising: 

(a) means for applying to the input layer successive 
ones of a plurality p of input vectors, for each of 
which the respective, desired output of the net- 
work is known, said input vectors forming an input 
matrix 


35 


40 


where i= 1, . . . , p and j= 1 nl; 45 

(b) means for determining a set of r orthogonal singu- 
lar vectors from said input matrix X such that the 
standard deviations of the projections of said input 
vectors along these singular vectors, as a set, are 
substantially maximized, said singular vectors each 50 
being denoted by a unit vector Vi, . . . , V„i, where 


Fl 2 +K 2 2 +... +V n \ 2 =l, 


and having an associated singular value which is a 55 
real number greater than or equal to zero, thereby 
to provide an optimal view of the input data; and 

(c) means for changing the weight vector W2/of each 
hidden layer node to minimize the error of the 
actual network output with respect to the desired 60 
output, while requiring during the training process 
that each hidden layer weight vector only be al- 
lowed to change in a direction parallel to one of the 
singular vectors of X. 

2. Apparatus of claim 1, wherein said neural network 65 
has at least three layers of nodes, with a third output 
layer having n3 nodes, each node of said third output 
layer having an output weight vector W3/, where i= 1, 


. . . ,n3, said apparatus further comprising means for 
determining the output weight vectors including: 

(d) means for independently optimizing the output 
weight vectors, there being n3 independent optimi- 
zations, each of which determines the output 
weight vector incident on each output node ac- 
cording to the Incremental Least Squares (ILS) 
procedure. 

3. Apparatus of claim 2, further comprising means for 
producing outputs at each of said first layer nodes 
which are a sigmoid function of the respective inputs. 

4 . Apparatus of claim 2, further comprising means for 
producing outputs at each of said second layer nodes 
which are a sigmoid function of the respective inputs. 

5. Apparatus of claim 1, further comprising means for 
producing outputs at each of said first layer nodes 
which are a sigmoid function of the respective inputs. 

6. Apparatus of claim 1, further comprising means for 
producing outputs at each of said second layer nodes 
which are a sigmoid function of the respective inputs. 

7. Apparatus for training a neural network composed 
of nodes having differentiable one-to-one nonlinear 
transfer functions such that, a plurality p of input vec- 
tors may be identified for each of which the respective, 
desired output vector of the network is known, said 
input vectors being represented as an input matrix 


where i= 1 , . . . ,p, j = 1 , . . . ,n, n being the dimensional- 
ity of the input vectors, and said output vectors being 
represented as an output matrix 


where i= 1, . . . ,p, j = 1, . . . ,m, m being the dimensional- 
ity of the output vectors; all nodes in the network to 
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which input vectors are presented being identified as 
input nodes denoted as 

li 

where n is the dimensionality of the input vectors; all 
nodes in the network from which output vectors are to 
be extracted being identified as output nodes denoted as 

. . - ,w m 

where m is the dimensionality of the output vectors; and 
the remaining nodes in the network being identified as 
hidden nodes denoted as 

£ li • ■ • *€/— (n+m) 

where t is the total number of nodes comprising the 
neural network; said apparatus comprising: 

(a) means for associating with each hidden node c/ a 
weight vector u,* representing the strength of all 
synaptic connections leading to said hidden node 
€/, where i— 1, . . . ,t— (n-f m), and associating with 
every output node a>/, a weight vector v/ represent- 
ing the strengths of all synaptic connections lead- 
ing to said output node c »/, where i= 1, . . . ,m; each 
hidden node €/ having identified therewith a set of 
optimal direction vectors denoted as d ij where 
i= 1, . . . ,t—(n + m), j = l, . . . ,r/, r/being the dimen- 
sionality of the weight vector u / associated with 
said hidden node e/and moreover being the number 
of nodes from which said hidden node €/ receives 
inputs as well as being equal to the dimensionality 
of said direction vectors d ij t the concept of opti- 
mality of said vector djj being defined in terms of 
an orthogonal direction along which the standard 
deviation of the projections of the inputs are essen- 
tially maximized, and said vectors d/j, being ob- 
tained as singular vectors of the input space for the 
hidden node cr, 

(b) means for imposing a constraint on each weight 
vector u/ which requires said weight vector to be 
aligned with a particular direction vector d/j(/), and 
sized by a variable scalar multiplier c/, said con- 
straint being expressed by the equation 

Ui=Cjdjj(j), 

where i=l, . . . ,t— (n+m) and the index j(i) is se- 
lected by processes which operate by choosing a 
direction vector along which changes in the 

weight vector u/ tend to most quickly decrease the 
deviations between the actual output vectors of the 
network measured at the output nodes oj * where 
k=l, . . . ,m, and the desired output vectors as 
represented by said output matrix Y, said deviation 
being measured by processes exemplified by but 
not limited to the root means square measure of 
error, said root means square error being defined 
by the equation 

‘-(lul, • 

where a ij is the result of the propagation of input 
vector i applied to all input nodes simultaneously 
and the result propagated throughout the network 
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to each output node a)j, where i = 1, . . . ,p, j = 1, . . 

* > m; 

(c) means for performing the Iterative Least Squares 
solution for the weight vector v/ identified with 

5 each output node o>j, where i= 1, . . . »m; 

(d) means for performing a numerical optimization of 
the scalar multipliers c/ which determine the 
weights identified with each hidden node €i, where 
i=l, . . . ,t— (n+m), said optimization being per- 

10 formed in such a manner as to adjust the totality of 
all said multipliers c/ so as to reduce deviation be- 
tween the output values generated by propagating 
all inputs through the network to the final output 
nodes denoted a>j, j=l, . . . ,m and the desired 
15 output values Y kj> k= 1, . . . ,p, j = 1, . . . ,m; 

(e) means for evaluating the selection of the index j(i) 
associated with the direction vector d y(i) at each 
hidden node €/, where i= 1, . . . ,t— (n-f- m), so that 
said index may be replaced by a choice consistent 

20 with the conditions set forth in step (b) as effected 
by evolution of the network through the training 
process; 

(f) means for reconstructing the entire set of direction 
vectors d ij associated with hidden node er, 

25 (g) means for performing a repetition of steps (a), . . . 

,(f) in such a manner as to effectively minimize 
deviations between the actual output vectors of the 
network and the desired output vectors, said devia- 
tions being dependent upon a specific implementa- 
30 tion, but exemplified by the root mean squares 
measure of error. 

8. Apparatus defined in claim 7 as applied to a layered 
neural network, the nodes of which are divided into 
some number K of separate classes, said node classes 
35 defining layers of said network, there being connections 
only between nodes in distinct layers; and wherein the 
totality of connections between any two layers L ,* and 
L j are completely characterized by a matrix 

40 

where l<=i<j<=K, a= 1, . . . ,ny,/3=l, . . . ,n,and n* 
n/are the respective numbers of nodes comprising layer 
i and layer j. 

^ 9. Apparatus defined in claim 7 as comprising a feed- 

forward neural network, said feed-forward network 
being characterized by the capability to propagate an 
input through the network in only the forward direc- 
tion so that inputs to each node are dependent on only 
50 those nodes seen to precede said node in the order of 
propagation of data through the network, the graphical 
realization of said feed-forward network being a di- 
rected graph with directed edges or arcs in place of the 
data flow connections of the network, and with the 
55 direction of said arcs being that of forward propagation 
of data through said data flow connections of the neural 
network, and further, with said directed graph being 
free of loops or cycles of any kind. 

10. Apparatus defined in claim 7 comprising a 3-layer 
60 feed-forward neural network, every hidden node c/ of 
said 3-layer feed-forward network receiving inputs ex- 
clusively from input nodes I/, where i= 1, . . . 
,t—(n+m), j = l, . . . ,n, said input nodes having values 
obtained directly from said input matrix X, the input 
65 space for said hidden node e,* being completely spanned, 
generated and defined by the vectors commonly refer- 
enced as the row vector of said input matrix X, thereby 
rendering said input space, as well as all singular vectors 
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and singular values thereof, invariant and constant with 
respect to all evolution arising from training; 
wherein the weights on all connections between the 
input nodes and hidden nodes are identified as the 
matrix 5 

U=uij 

where i= 1, . . . ,t— (n+m), j = 1, . . . ,n; 
the weights on all connections leading to output 10 
nodes are identified as the matrix 

fV=w U 

where i= 1, , . . ,m, j — 1, . . . ,r, the value r being 15 
sufficient to support such connections as are re- 


36 

quired for the implementation, in particular, if di- 
rect connections from input to output are to be 
realized, r=t— m; and the inputs to all output 
nodes are identified as the matrix 

Z=Zij 

where i= 1, . . . ,p, j=l, . . . ,r; said apparatus fur- 
ther comprising: 

(h) means for obtaining for each hidden node a the 
optimal set of directions dy by extracting the singu- 
lar vectors from the input space of the node, said 
singular vectors being substantially equivalent to 
the singular vectors of the input matrix X; and 

(i) means for using the Iterative Least Squares (ILS) 
method to obtain an optimal set of output weights. 

* * * * * 
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